18 research outputs found

    SQL Query Completion for Data Exploration

    Full text link
    Within the big data tsunami, relational databases and SQL are still there and remain mandatory in most of cases for accessing data. On the one hand, SQL is easy-to-use by non specialists and allows to identify pertinent initial data at the very beginning of the data exploration process. On the other hand, it is not always so easy to formulate SQL queries: nowadays, it is more and more frequent to have several databases available for one application domain, some of them with hundreds of tables and/or attributes. Identifying the pertinent conditions to select the desired data, or even identifying relevant attributes is far from trivial. To make it easier to write SQL queries, we propose the notion of SQL query completion: given a query, it suggests additional conditions to be added to its WHERE clause. This completion is semantic, as it relies on the data from the database, unlike current completion tools that are mostly syntactic. Since the process can be repeated over and over again -- until the data analyst reaches her data of interest --, SQL query completion facilitates the exploration of databases. SQL query completion has been implemented in a SQL editor on top of a database management system. For the evaluation, two questions need to be studied: first, does the completion speed up the writing of SQL queries? Second , is the completion easily adopted by users? A thorough experiment has been conducted on a group of 70 computer science students divided in two groups (one with the completion and the other one without) to answer those questions. The results are positive and very promising

    Sélection de données guidée pour les modèles prédictifs

    No full text
    Databases and machine learning (ML) have historically evolved as two separate domains: while databases are used to store and query the data, ML is devoted to predictive models inference, clustering, etc. Despite its apparent simplicity, the “data preparation” step of ML applications turns out to be the most time-consuming step in practice. Interestingly this step encompasses the bridge between databases and ML. In this setting, we raise and address three main problems related to data selection for building predictive models. First, the database usually contains more than the data of interest: how to separate the data that the analyst wants from the one she does not want? We propose to see this problem as imbalanced classification between the tuples of interest and the rest of the database. We develop an undersampling method based on the functional dependencies of the database. Second, we discuss the writing of the query returning the tuples of interest. We propose a SQL query completion solution based on data semantics, that starts from a very general query, and helps an analyst to refine it until she selects her data of interest. This process aims at helping the analyst to design the query that will eventually select the data she requires. Third, assuming the data has successfully been extracted from the database, the next natural question follows: is the selected data suited to answer the considered ML problem? Since getting a predictive model from the features to the class to predict amounts to providing a function, we point out that it makes sense to first assess the existence of that function in the data. This existence can be studied through the prism of functional dependencies, and we show how they can be used to understand a model’s limitation, and to refine the initial data selection if necessary.Les bases de données et l'apprentissage ont historiquement évolués comme deux domaines distincts: alors que les bases de données sont utilisées pour stocker et interroger les données, l'apprentissage se consacre à la détermination de modèle prédictifs, au clustering, etc. Malgré son apparente simplicité, la phase de sélection des données pour l'apprentissage est souvent très chronophage en pratique. Il est intéressante de noter que cet étape fait le pont entre les bases de données et l'apprentissage. Dans ce contexte, nous soulevons et considérons trois problèmes liés à la sélection de données pour les modèles prédictifs. Premièrement, la base de données contient généralement plus que les données d'intérêt: comment séparer les données que l'analyste veut de celles qu'elle ne veut pas? Nous proposons de voir ce problème comme une classification déséquilibrée entre les tuples d'intérêt et le reste de la base de données. Nous développons une méthode de sous-échantillonnage basée sur les dépendances fonctionnelles de la base de données. Deuxièmement, nous discutons de l'écriture de la requête renvoyant les tuples d'intérêt. Nous proposons une solution de complétion de requête SQL basée sur la sémantique des données, qui part d'une requête très générale, et aide un analyste à l'affiner jusqu'à ce qu'elle sélectionne ses données d'intérêt. Ce processus vise à aider l'analyste à concevoir la requête qui finira par sélectionner les données dont elle a besoin. Troisièmement, en supposant que les données ont été extraites avec succès de la base de données, on peut se poser la question suivante: les données sélectionnées sont-elles adaptées pour répondre au problème d'apprentissage considéré ? Puisque construire un modèle prédictif est équivalent à déterminer une fonction, nous soulignons qu'il est logique de d'abord évaluer l'existence de cette fonction dans les données. Cette existence peut être étudiée à travers le prisme des dépendances fonctionnelles, et nous montrons comment elles peuvent être utilisées pour comprendre les limitations d'un modèle et affiner la sélection initiale des données si nécessaire

    Sélection de données guidée pour les modèles prédictifs

    No full text
    Databases and machine learning (ML) have historically evolved as two separate domains: while databases are used to store and query the data, ML is devoted to predictive models inference, clustering, etc. Despite its apparent simplicity, the “data preparation” step of ML applications turns out to be the most time-consuming step in practice. Interestingly this step encompasses the bridge between databases and ML. In this setting, we raise and address three main problems related to data selection for building predictive models. First, the database usually contains more than the data of interest: how to separate the data that the analyst wants from the one she does not want? We propose to see this problem as imbalanced classification between the tuples of interest and the rest of the database. We develop an undersampling method based on the functional dependencies of the database. Second, we discuss the writing of the query returning the tuples of interest. We propose a SQL query completion solution based on data semantics, that starts from a very general query, and helps an analyst to refine it until she selects her data of interest. This process aims at helping the analyst to design the query that will eventually select the data she requires. Third, assuming the data has successfully been extracted from the database, the next natural question follows: is the selected data suited to answer the considered ML problem? Since getting a predictive model from the features to the class to predict amounts to providing a function, we point out that it makes sense to first assess the existence of that function in the data. This existence can be studied through the prism of functional dependencies, and we show how they can be used to understand a model’s limitation, and to refine the initial data selection if necessary.Les bases de données et l'apprentissage ont historiquement évolués comme deux domaines distincts: alors que les bases de données sont utilisées pour stocker et interroger les données, l'apprentissage se consacre à la détermination de modèle prédictifs, au clustering, etc. Malgré son apparente simplicité, la phase de sélection des données pour l'apprentissage est souvent très chronophage en pratique. Il est intéressante de noter que cet étape fait le pont entre les bases de données et l'apprentissage. Dans ce contexte, nous soulevons et considérons trois problèmes liés à la sélection de données pour les modèles prédictifs. Premièrement, la base de données contient généralement plus que les données d'intérêt: comment séparer les données que l'analyste veut de celles qu'elle ne veut pas? Nous proposons de voir ce problème comme une classification déséquilibrée entre les tuples d'intérêt et le reste de la base de données. Nous développons une méthode de sous-échantillonnage basée sur les dépendances fonctionnelles de la base de données. Deuxièmement, nous discutons de l'écriture de la requête renvoyant les tuples d'intérêt. Nous proposons une solution de complétion de requête SQL basée sur la sémantique des données, qui part d'une requête très générale, et aide un analyste à l'affiner jusqu'à ce qu'elle sélectionne ses données d'intérêt. Ce processus vise à aider l'analyste à concevoir la requête qui finira par sélectionner les données dont elle a besoin. Troisièmement, en supposant que les données ont été extraites avec succès de la base de données, on peut se poser la question suivante: les données sélectionnées sont-elles adaptées pour répondre au problème d'apprentissage considéré ? Puisque construire un modèle prédictif est équivalent à déterminer une fonction, nous soulignons qu'il est logique de d'abord évaluer l'existence de cette fonction dans les données. Cette existence peut être étudiée à travers le prisme des dépendances fonctionnelles, et nous montrons comment elles peuvent être utilisées pour comprendre les limitations d'un modèle et affiner la sélection initiale des données si nécessaire

    Procédé de complétion de requêtes SQL

    No full text

    A First Experimental Study on Functional Dependencies for Imbalanced Datasets Classification

    No full text
    International audienceImbalanced datasets for classification is a recurring problem in machine learning, as most real-life datasets present classes that are not evenly distributed. This causes many problems for classification algorithms trained on such datasets, as they are often biases towards the majority class. Moreover, the minority class often yields more interest for data scientist, when at the same time it is also the hardest to predict. Many different approaches have been proposed to tackle the problem of imbalanced datasets: they often rely on the sampling of the majority class, or the creation of synthetic examples for the minority one. In this paper, we take a completely different perspective on this problem: we propose to use the notion of distance between databases, to sample from the majority class, so that the minority and majority class are as distant as possible. The chosen distance is based on functional dependencies, with the intuition of capturing inherent constraints of the database. We propose algorithms to generate distant synthetic datasets, as well as ex-perimentations to verify our conjecture on the classification on distant instances. Despite the mitigated results obtained so far, we believe this is a promising research direction, at the intersection of machine learning and databases, and it deserves more investigations

    SQL query extensions for imprecise questions

    No full text
    International audienceWithin the big data tsunami, relational databases and SQL remain inescapable in most cases for accessing data. If SQL is easy-to-use and has proved its robustness over the years, it is not always easy to formulate SQL queries as it is more and more frequent to have databases with hundreds of tables and/or attributes. Identifying the pertinent conditions to select the desired data, or even the relevant attributes, is not trivial, especially when the user only has an imprecise question in mind, and is not sure of how to translate its conditions directly into SQL.To make it easier to write SQL queries when the initial question is imprecise, we propose SQL query extensions: given a query, it suggests several possible additional selection clauses, to complete the Where clause of the query, as a form of SQL query semantic autocompletion. This is helpful for both understanding the initial query’s results, and refining the query to reach the desired tuples. The process is iterative, as a query constructed using an extension can also be completed. It is also adaptable, as the number of extensions to compute is flexible. A prototype has been implemented in a SQL editor on top of a database management system, and two types of evaluation are proposed. A first one looks at the scaling of the system with a large number of tuples. Then a user study examines two questions: does the extension tool speed up the writing of SQL queries? And is it easily adopted by users? A thorough experiment was conducted on a group of 70 computer science students divided in two groups (one with the extension tool and the other one without) to answer those questions. In the end, the results showed a faster answering time for students that could use the extensions: 32 min on average to complete the test for the group with extensions, against 48 min for the others

    A First Experimental Study on Functional Dependencies for Imbalanced Datasets Classification

    No full text
    International audienceImbalanced datasets for classification is a recurring problem in machine learning, as most real-life datasets present classes that are not evenly distributed. This causes many problems for classification algorithms trained on such datasets, as they are often biases towards the majority class. Moreover, the minority class often yields more interest for data scientist, when at the same time it is also the hardest to predict. Many different approaches have been proposed to tackle the problem of imbalanced datasets: they often rely on the sampling of the majority class, or the creation of synthetic examples for the minority one. In this paper, we take a completely different perspective on this problem: we propose to use the notion of distance between databases, to sample from the majority class, so that the minority and majority class are as distant as possible. The chosen distance is based on functional dependencies, with the intuition of capturing inherent constraints of the database. We propose algorithms to generate distant synthetic datasets, as well as ex-perimentations to verify our conjecture on the classification on distant instances. Despite the mitigated results obtained so far, we believe this is a promising research direction, at the intersection of machine learning and databases, and it deserves more investigations

    SQL Query Completion for Data Exploration

    No full text
    Within the big data tsunami, relational databases and SQL are still there and remain mandatory in most of cases for accessing data. On the one hand, SQL is easy-to-use by non specialists and allows to identify pertinent initial data at the very beginning of the data exploration process. On the other hand, it is not always so easy to formulate SQL queries: nowadays, it is more and more frequent to have several databases available for one application domain, some of them with hundreds of tables and/or attributes. Identifying the pertinent conditions to select the desired data, or even identifying relevant attributes is far from trivial. To make it easier to write SQL queries, we propose the notion of SQL query completion: given a query, it suggests additional conditions to be added to its WHERE clause. This completion is semantic, as it relies on the data from the database, unlike current completion tools that are mostly syntactic. Since the process can be repeated over and over again – until the data analyst reaches her data of interest –, SQL query completion facilitates the exploration of databases. SQL query completion has been implemented in a SQL editor on top of a database management system. For the evaluation, two questions need to be studied: first, does the completion speed up the writing of SQL queries? Second , is the completion easily adopted by users? A thorough experiment has been conducted on a group of 70 computer science students divided in two groups (one with the completion and the other one without) to answer those questions. The results are positive and very promising
    corecore